Skip to content

feat(graphrag,mcp,vectordb,storage): tenant isolation rollup (RAN-37, RAN-38, RAN-39, RAN-21, RAN-20)#28

Closed
aksOps wants to merge 5 commits into
ran37-graphrag-tenant-partitioned-storesfrom
ran38-graphrag-tenant-persistence
Closed

feat(graphrag,mcp,vectordb,storage): tenant isolation rollup (RAN-37, RAN-38, RAN-39, RAN-21, RAN-20)#28
aksOps wants to merge 5 commits into
ran37-graphrag-tenant-partitioned-storesfrom
ran38-graphrag-tenant-persistence

Conversation

@aksOps

@aksOps aksOps commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Multi-tenancy isolation across GraphRAG persistence + in-memory + vector index + trace identity, plus tenant-scoped MCP tools. Implements RAN-37, RAN-38, RAN-39, RAN-21, and RAN-20 on a single branch because they share a common code surface and were sequenced as a unit. Each issue is documented inline below; commits are atomic per issue.

Issues addressed

  • RAN-37 — GraphRAG in-memory store partitioning (ServiceStore, TraceStore, SignalStore, AnomalyStore keyed by tenant; query ctx threading).
  • RAN-38tenant_id column + scoped reads + idempotent backfill on the three persisted GraphRAG tables (investigations, graph_snapshots, drain_templates).
  • RAN-39 — MCP request-tenant ctx wiring through GraphRAG handlers + merge-gate two-tenant integration test.
  • RAN-21 — tenant-scoped trace_id uniqueness so the same trace ID can appear across tenants without collision.
  • RAN-20 — semantic log search tenant isolation: vectordb.Index filter-on-read + tenant-aware FIFO eviction; cross-tenant startup hydration helper; HTTP + MCP integration tests including FIFO eviction-fairness assertion.

RAN-38 — GraphRAG persistence

  • Schema: investigations and graph_snapshots get (tenant_id, created_at) composite indexes. drain_templates PK is promoted to composite (tenant_id, id) so the same Drain template hash can coexist across tenants once a per-tenant miner ships.
  • Reads: GetInvestigations, GetInvestigation, and GetGraphSnapshot now take a context.Context and filter by storage.TenantFromContext. The MCP dispatcher already had a ctx in scope and threads it through; cross-tenant id-guess lookups return ErrRecordNotFound.
  • Writes: PersistInvestigation writes its tenant onto the row; takeSnapshotForTenant does the same; SaveDrainTemplates/LoadDrainTemplates now take a tenant parameter and target the composite PK in OnConflict.
  • Cooldown: investigationCooldown's key includes tenant so an error in tenant-A doesn't suppress the same pattern in tenant-B.
  • Migration: AutoMigrateGraphRAG runs three passes — AutoMigrate → idempotent UPDATE … WHERE tenant_id IS NULL OR '' backfill across all three tables → driver-specific drain_templates PK promotion. SQLite uses the canonical rename/recreate/copy/drop recipe; PostgreSQL uses ALTER TABLE … DROP CONSTRAINT IF EXISTS … ADD PRIMARY KEY (tenant_id, id). MySQL/MSSQL skip with a logged warn so an operator can apply equivalent DDL by hand.

RAN-20 — semantic log search tenant isolation

  • vectordb.Index now records Tenant per LogVector and filters on read in Search(tenant, q, k). Empty tenant coerces to platform default (mirrors storage.DefaultTenantID).
  • Tenant-aware FIFO eviction: when at cap, Add drops up to maxSize/10 oldest entries belonging to the inserting tenant only — a noisy tenant can no longer push another tenant's warm rows out. Brand-new tenant on a full index drops at most one globally-oldest row as a one-time fairness fallback.
  • Cross-tenant startup hydration: new admin-scoped repo.ListRecentHighSeverityLogsAllTenants (cross-tenant by design, not exposed on any tenant-scoped API surface). Closes the silent failure where GetLogsV2-based hydration only repopulated the default tenant's index on restart.
  • Surfaces fixed: internal/api/similar_handler.go resolves tenant via storage.TenantFromContext(r.Context()); internal/mcp/tools.go::toolFindSimilarLogs resolves via mcpCtx(ctx) + TenantFromContext.
  • Tests: internal/vectordb/index_test.go (5 tests incl. TestFIFOEvictionFairness), internal/api/similar_handler_test.go (HTTP two-tenant integration via real TenantMiddleware), internal/mcp/tools_ran20_test.go (MCP two-tenant + no-tenant-fallback).

RAN-21 — trace_id tenant scoping

Promotes the traces.trace_id unique constraint to (tenant_id, trace_id) so concurrent tenants can ingest overlapping trace IDs without collision. Migration handles SQLite, Postgres; warns on MySQL/MSSQL.

RAN-39 — MCP tenant ctx + isolation test

Threads r.Context() end-to-end through every GraphRAG-backed MCP tool (get_error_chains, trace_graph, impact_analysis, root_cause_analysis, correlated_signals, get_anomaly_timeline, get_service_map, get_investigations, get_investigation, get_graph_snapshot). Adds internal/mcp/tenant_isolation_test.go as the merge-gate two-tenant integration test.

RAN-37 — store partitioning

Introduces storesFor(ctx) — per-tenant ServiceStore/TraceStore/SignalStore/AnomalyStore resolution. All query helpers take ctx; OnSpan/Log/MetricIngested fan into the right tenant's stores via the ingest tenant resolution chain.

Out of scope

  • Per-tenant Drain miner instance — schema is ready, g.drain stays single-instance keyed under DefaultTenantID. Tracked separately.
  • Generic TenantPartitioned[K, V] abstraction — TechLead and I converged on ad-hoc maps for this PR; extraction to a follow-up if shapes converge cleanly post-merge.

Test plan

  • go vet ./... clean
  • go build ./... clean
  • go test ./... -count=1 — full suite passes
  • go test -race -count=1 ./internal/{mcp,graphrag,vectordb,api,storage}/... clean
  • RAN-38 migration tests (composite indexes, drain PK, idempotent re-run, legacy backfill)
  • RAN-37 store partitioning + cross-tenant query isolation
  • RAN-39 merge-gate tenant_isolation_test.go
  • RAN-21 trace-id collision across tenants
  • RAN-20 TestFIFOEvictionFairness (tenant B flood does not evict tenant A canaries) + HTTP/MCP two-tenant integration

🤖 Generated with Claude Code

Adds tenant_id to the three persisted GraphRAG tables (investigations,
graph_snapshots, drain_templates), with a one-shot idempotent backfill so
existing rows fall back to DefaultTenantID and a driver-specific composite
PK promotion for drain_templates on SQLite/PostgreSQL.

- schema.go: TenantID first on DrainTemplateRow; PK becomes (tenant_id, id).
- investigation.go: TenantID + idx_investigations_tenant_created composite
  index; cooldown key now scoped by tenant; GetInvestigations / Get-by-id
  take a context and filter by storage.TenantFromContext.
- snapshot.go: TenantID + idx_graph_snapshots_tenant_created;
  GetGraphSnapshot is ctx-scoped. takeSnapshot continues to fan out per
  tenant.
- drain.go: SaveDrainTemplates / LoadDrainTemplates now take a tenant
  string; OnConflict targets composite (tenant_id, id) so the same
  template hash can coexist across tenants for the eventual per-tenant
  Drain miner.
- migrate.go (new): AutoMigrateGraphRAG runs AutoMigrate, idempotent
  UPDATE backfill, and a SQLite (rename/recreate/copy) + PostgreSQL
  (DROP/ADD CONSTRAINT) PK promotion path. MySQL/MSSQL skipped with a log.
- mcp/tools.go: thread the dispatcher ctx into get_investigations,
  get_investigation, get_graph_snapshot so the new ctx-aware signatures
  compile end-to-end. Subtask C (RAN-39) wires the request tenant.
- builder.go / refresh.go: Drain Save/Load callers pin
  storage.DefaultTenantID for the single shared miner.

Tests:
- TestAutoMigrateGraphRAG_CreatesTenantCompositeIndexes
- TestAutoMigrateGraphRAG_DrainTemplatesCompositePK
- TestAutoMigrateGraphRAG_IsIdempotent
- TestAutoMigrateGraphRAG_BackfillsLegacyRows
- TestSaveLoadDrainTemplates_TenantIsolation (collision across tenants)
- TestGraphRAG_GetInvestigations_TenantScoped (cross-tenant id-guess miss)
- TestGraphRAG_GetGraphSnapshot_TenantScoped
- TestCooldownKey_TenantIsolated

Verification: go vet ./... and go test ./... — 281 tests pass across 26
packages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62768be8ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +141 to +145
if err := tx.Exec(`ALTER TABLE drain_templates RENAME TO drain_templates__legacy`).Error; err != nil {
return fmt.Errorf("rename legacy: %w", err)
}
// Recreate with the current GORM schema (composite PK + indexes).
if err := tx.Migrator().CreateTable(&DrainTemplateRow{}); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Drop legacy indexes before creating new drain_templates table

On SQLite upgrades from the pre-RAN-38 schema, ALTER TABLE ... RENAME TO drain_templates__legacy keeps legacy index names (for first_seen/last_seen) attached to the renamed table. Creating the new table immediately afterward tries to create indexes with the same names and fails with index ... already exists, so PK promotion never completes; because writes now use ON CONFLICT (tenant_id, id), subsequent SaveDrainTemplates calls fail on those upgraded databases.

Useful? React with 👍 / 👎.

aksOps and others added 3 commits April 24, 2026 21:01
…n test (RAN-39)

Threads the tenant resolved by the MCP transport (X-Tenant-ID header → ctx)
into every GraphRAG-backed tool handler and adds the merge-gate integration
test that asserts cross-tenant isolation for the full GraphRAG-backed MCP
surface.

mcp/tools.go
- get_system_graph and get_service_health now accept ctx and route through
  GraphRAG.ServiceMap / AllServiceEdges so they pick up RAN-37's per-tenant
  in-memory partitioning. Legacy svcGraph remains as a fallback path only
  when GraphRAG isn't wired (boot windows, future test harnesses).
- All other GraphRAG handlers were already ctx-threaded after RAN-37/38;
  no behavior change for those.

vectordb/index.go (+ main.go, api/similar_handler.go)
- vectordb.Index.Add and Search now take a tenant string; LogVector and
  SearchResult carry the tenant tag and Search filters by it. RAN-37
  already added tenant args at the call sites in graphrag/clustering.go
  and mcp/tools.go but the matching vectordb signature change had not
  landed, leaving the branch unbuildable. This closes that gap with the
  smallest surgical change; the broader vectordb rework remains RAN-20.
- main.go now passes l.TenantID into vectorIdx.Add on hydration and the
  live ingest hook; api/similar_handler resolves tenant from the request
  context before searching.

graphrag/builder.go
- New RegisterAnomaly(tenant, AnomalyNode) — small public API symmetric
  with PersistInvestigation, used by the new isolation test to seed
  per-tenant anomalies without depending on the throttled detector loop.

mcp/tenant_isolation_test.go
- Stands up an in-process MCP server (httptest) wired to GraphRAG over
  in-memory SQLite, seeds three tenants (acme, beta, default) with
  overlapping service_name / trace_id / span_id / Drain template / log
  body / snapshot, and exercises every GraphRAG-backed tool — get_service_map,
  get_service_health, get_error_chains, trace_graph, impact_analysis,
  root_cause_analysis, correlated_signals, get_anomaly_timeline,
  get_investigations, get_investigation (own + cross-tenant id-guess),
  get_graph_snapshot, find_similar_logs, get_system_graph — three times
  each (X-Tenant-ID acme, X-Tenant-ID beta, no header → DefaultTenantID).
  Each response is scanned for the caller's own tenant marker and for any
  other seeded tenant's marker (service name, log body, op name, anomaly
  evidence, snapshot id) to prove no cross-tenant leak.

Verified: go vet ./... clean; go test ./... clean; go test -race
./internal/{mcp,graphrag}/... clean.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Replace standalone uniqueIndex on Trace.TraceID with composite
idx_traces_tenant_trace_id(tenant_id, trace_id) so two tenants can
ingest the same trace_id without either dropping into OnConflict.DoNothing.
AutoMigrateModels now idempotently discovers and drops any legacy
single-column unique index on traces.trace_id (SQLite/Postgres/MySQL/MSSQL)
so upgraded databases stop silently blocking cross-tenant reuse.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…isolation

Reviewer (cf5145d8) requested three changes on commit c839460. Addresses
each with verified failure-on-regression checks.

#1 — Cross-tenant boot hydration of vector index
Previously main.go hydrated the index via repo.GetLogsV2(appCtx, ...) which
is tenant-scoped to whatever tenant ctx carries — appCtx has none, so only
the default tenant's rows reloaded after restart and find_similar_logs was
cold for every other tenant until fresh ERROR logs landed. The new
tenant-aware vectorIdx.Add(..., l.TenantID, ...) didn't fix it because the
non-default rows were never fetched.

- internal/storage/log_repo.go: new ListRecentHighSeverityLogsAllTenants
  — explicitly cross-tenant administrative read used only by hydration.
  Each row carries its own TenantID; fan-out happens in the caller.
- main.go: hydration now uses the new method so every tenant's warm
  index survives a restart.
- internal/vectordb/index.go: tenant-aware FIFO eviction. At cap, drop
  up to maxSize/10 of the inserting tenant's oldest rows so a noisy
  tenant cannot evict another tenant's warm rows (availability isolation;
  confidentiality is still enforced by doc.Tenant filtering in Search).
  Brand-new tenants drop one globally-oldest row to claim a first slot.

#2 — root_cause_analysis assertion was vacuous
internal/mcp/tenant_isolation_test.go:434 passed an empty ownMarker to
assertNoLeak, so the merge gate would still pass if the tool regressed
to returning [] for every tenant. Now passes ownService (RankedCause
carries Service so it must appear in a non-empty response). Verified by
sabotaging the handler to return a nil slice — the assertion fails with
'expected own marker "acme-orders" in response, body=null' as designed,
then reverted.

#3 — Drain cluster-id test didn't actually compare cluster IDs
The previous test reused seedTenant (per-tenant service names) and only
scanned response text, so a regression that surfaced the same cluster row
across tenants would still pass. Rewritten to:

- Use one shared service name across both tenants so Drain produces
  colliding (service, templateID) keys — the SignalStore partition is the
  only thing keeping rows separate.
- Inspect the actual []graphrag.LogClusterNode returned by
  CorrelatedSignals (not just response text), checking Template +
  SampleLog content for own-marker presence and foreign-marker absence.
- Log the per-tenant cluster IDs so future refactors that change the ID
  scheme leave a visible audit trail.
- End-to-end probe via the MCP HTTP surface remains, asserting the same
  isolation reaches clients.

RAN-20 supporting tests
internal/vectordb/index_test.go, internal/api/similar_handler_test.go,
internal/mcp/tools_ran20_test.go cover vectordb tenant scoping at three
layers (in-memory, REST handler, MCP tool). They were sitting untracked
on the branch from the parallel RAN-20 work; bundling them so the
follow-up vectordb behavior added here is covered.

Verified:
- go vet ./... clean
- go test ./... clean (full repo)
- go test -race ./internal/{mcp,graphrag,vectordb,api}/... clean

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@aksOps aksOps changed the title feat(graphrag): tenant_id column + scoped reads + backfill (RAN-38) feat(graphrag,mcp,vectordb): multi-tenancy isolation — RAN-37/38/39 + RAN-21 + RAN-20 Apr 25, 2026
Tightens the docstring to spell out that results are scoped to the
tenant on r.Context() (set by TenantMiddleware from X-Tenant-ID) and
cross-tenant rows are never returned. Behavior unchanged.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps changed the title feat(graphrag,mcp,vectordb): multi-tenancy isolation — RAN-37/38/39 + RAN-21 + RAN-20 feat(graphrag,mcp,vectordb,storage): tenant isolation rollup (RAN-37, RAN-38, RAN-39, RAN-21, RAN-20) Apr 25, 2026
@aksOps aksOps deleted the branch ran37-graphrag-tenant-partitioned-stores April 25, 2026 16:00
@aksOps aksOps closed this Apr 25, 2026
@aksOps aksOps deleted the ran38-graphrag-tenant-persistence branch April 26, 2026 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant